Introducing Multiple Regression
install.packages(c("effects","car","stargazer"))
The HR Department needs a hiring diagnostic for choosing among applicants for employment. The goal is to maximize the productivity of chosen future employees by building and deploying a model for predicting and explaining worker productivity. We are to explain With as a function, potentially, of:
| skim_type | skim_variable | n_missing | complete_rate | numeric.mean | numeric.sd | numeric.p0 | numeric.p25 | numeric.p50 | numeric.p75 | numeric.p100 | numeric.hist |
|---|---|---|---|---|---|---|---|---|---|---|---|
| numeric | Subject | 0 | 1 | 18.5000 | 10.53565 | 1 | 9.75 | 18.5 | 27.25 | 36 | ▇▇▇▇▇ |
| numeric | General | 0 | 1 | 179.5556 | 54.06105 | 86 | 134.50 | 194.0 | 224.50 | 269 | ▆▅▃▇▅ |
| numeric | Dexterity | 0 | 1 | 176.0833 | 44.81860 | 86 | 143.50 | 173.5 | 213.50 | 254 | ▂▇▆▆▆ |
| numeric | Without | 0 | 1 | 104.0000 | 12.98131 | 72 | 94.00 | 104.5 | 114.00 | 136 | ▁▇▇▇▁ |
| numeric | With | 0 | 1 | 109.0000 | 12.24745 | 84 | 99.00 | 109.0 | 119.00 | 134 | ▅▆▇▅▂ |
With the device?
Without the device?
Diff.G.0 <- (Hiring$With - Hiring$Without > 0)
ggplot(Hiring, aes(x=Subject, ymin=Without, ymax=With)) + geom_linerange(aes(color=Diff.G.0)) + geom_point(aes(x=Subject, y=With), color="green", size=2, alpha=0.3) + geom_point(aes(x=Subject, y=Without), color="red", size=2, alpha=0.3) + theme_minimal() + labs(color="With Bigger?")So the device works. It generates 3.76 to 6.24 units per period above and beyond Without. That means, at most, 3.14 periods.
One Sample t-test
data: Hiring$With - Hiring$Without
t = 8.2041, df = 35, p-value = 1.147e-09
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
3.762752 6.237248
sample estimates:
mean of x
5
[1] 3.970291
[1] 3.148384
Let’s get a feel for the data. How about correlation? It’s pretty but not that useful.
Remove Subject and we are good to go.
Our focus will be the bottom row.
Call:
lm(formula = Without ~ General, data = Hiring)
Residuals:
Min 1Q Median 3Q Max
-19.0342 -2.8739 -0.6903 3.7688 12.8486
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 65.55449 3.49138 18.78 < 2e-16 ***
General 0.21411 0.01864 11.49 2.99e-13 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 5.962 on 34 degrees of freedom
Multiple R-squared: 0.7951, Adjusted R-squared: 0.7891
F-statistic: 131.9 on 1 and 34 DF, p-value: 2.99e-13
Call:
lm(formula = Without ~ Dexterity, data = Hiring)
Residuals:
Min 1Q Median 3Q Max
-39.655 -4.390 0.013 3.825 28.814
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 76.99821 7.65028 10.065 9.89e-12 ***
Dexterity 0.15335 0.04214 3.639 0.000899 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 11.17 on 34 degrees of freedom
Multiple R-squared: 0.2803, Adjusted R-squared: 0.2591
F-statistic: 13.24 on 1 and 34 DF, p-value: 0.000899
A few answers. What’s the question?
If predictive accuracy is the goal, the residual standard error is larger for dexterity than general.
If explanatory power is the goal, the r^2 is much smaller for dexterity than general.
The graphics suggest that general provides a better fit; note that the points are typically closer to the line [in vertical distance and implies both of the above observations]
Let’s look up F. In the above example, it is proportional to R^2. That’s because there is only one input.
It is a ratio of two \chi^2; or two appropriately scaled normals. Since we are looking at the same variable with a constant metric, this problem can be solved. In the simplest of terms, F will have the ratio of the average explained over the average unexplained. What are we averaging over? Degrees of freedom, from a baseline of n-1.
Model.Dex <- lm(Without ~ Dexterity, data=Hiring)
Model.Gen <- lm(Without ~ General, data=Hiring)
anova(Model.Dex)Analysis of Variance Table
Response: Without
Df Sum Sq Mean Sq F value Pr(>F)
Dexterity 1 1653.2 1653.23 13.242 0.000899 ***
Residuals 34 4244.8 124.85
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Analysis of Variance Table
Response: Without
Df Sum Sq Mean Sq F value Pr(>F)
General 1 4689.5 4689.5 131.94 2.99e-13 ***
Residuals 34 1208.5 35.5
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
| Dependent variable: | |||
| With | |||
| (1) | (2) | (3) | |
| General | 0.219*** | 0.095*** | |
| (0.019) | (0.018) | ||
| Dexterity | -0.009 | 0.059*** | |
| (0.023) | (0.012) | ||
| Without | 0.797*** | 0.551*** | |
| (0.041) | (0.075) | ||
| Constant | 71.286*** | 15.670*** | 34.560*** |
| (2.935) | (3.664) | (5.148) | |
| Observations | 36 | 36 | 36 |
| R2 | 0.888 | 0.955 | 0.957 |
| Adjusted R2 | 0.881 | 0.952 | 0.955 |
| Residual Std. Error (df = 33) | 4.223 | 2.682 | 2.607 |
| F Statistic (df = 2; 33) | 130.667*** | 348.301*** | 369.641*** |
| Note: | p<0.1; p<0.05; p<0.01 | ||
All are plausibly normal
Shapiro-Wilk normality test
data: Mod.DG$residuals
W = 0.96384, p-value = 0.2816
Shapiro-Wilk normality test
data: Mod.DW$residuals
W = 0.98128, p-value = 0.7876
Shapiro-Wilk normality test
data: Mod.GW$residuals
W = 0.9639, p-value = 0.2828
If residuals are not normal, then the slopes are not t and ratios of variance are not F. Everything we might want to use the regression for falls apart. For now at least.
[1] 24 31
[1] 16 13
[1] 16 28
Call:
lm(formula = With ~ General + Dexterity, data = Hiring)
Coefficients:
(Intercept) General Dexterity
71.286437 0.218684 -0.008816
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma(x = Mod.DG)
Value p-value Decision
Global Stat 7.3824 0.11701 Assumptions acceptable.
Skewness 2.6704 0.10223 Assumptions acceptable.
Kurtosis 0.1628 0.68659 Assumptions acceptable.
Link Function 3.6686 0.05545 Assumptions acceptable.
Heteroscedasticity 0.8806 0.34804 Assumptions acceptable.
Call:
lm(formula = With ~ Dexterity + Without, data = Hiring)
Coefficients:
(Intercept) Dexterity Without
15.67018 0.05927 0.79705
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma(x = Mod.DW)
Value p-value Decision
Global Stat 2.8349 0.5858 Assumptions acceptable.
Skewness 0.9045 0.3416 Assumptions acceptable.
Kurtosis 0.7883 0.3746 Assumptions acceptable.
Link Function 1.0035 0.3165 Assumptions acceptable.
Heteroscedasticity 0.1387 0.7096 Assumptions acceptable.
Call:
lm(formula = With ~ General + Without, data = Hiring)
Coefficients:
(Intercept) General Without
34.55964 0.09543 0.55101
ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
Level of Significance = 0.05
Call:
gvlma(x = Mod.GW)
Value p-value Decision
Global Stat 3.739e+00 0.4425 Assumptions acceptable.
Skewness 5.974e-04 0.9805 Assumptions acceptable.
Kurtosis 2.699e+00 0.1004 Assumptions acceptable.
Link Function 3.815e-05 0.9951 Assumptions acceptable.
Heteroscedasticity 1.039e+00 0.3080 Assumptions acceptable.
From our chosen model, a few cool things….
General Without
1 119 86
fit lwr upr
1 93.3029 91.76003 94.84578
fit lwr upr
1 93.3029 87.77848 98.82733
It is the ratio of two boxes where each one is normalized by the degrees of freedom.
Mod.WithWithout <- lm(With~Without , data=Hiring)
anova(Mod.WithWithout, lm(With~Without+General, data=Hiring))Analysis of Variance Table
Model 1: With ~ Without
Model 2: With ~ Without + General
Res.Df RSS Df Sum of Sq F Pr(>F)
1 34 415.21
2 33 224.34 1 190.87 28.078 7.638e-06 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Analysis of Variance Table
Model 1: With ~ Without
Model 2: With ~ Without + Dexterity
Res.Df RSS Df Sum of Sq F Pr(>F)
1 34 415.21
2 33 237.46 1 177.75 24.702 2.014e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
And there is still something to be learned at the margins.
Analysis of Variance Table
Model 1: With ~ Without + General
Model 2: With ~ Without + General + Dexterity
Res.Df RSS Df Sum of Sq F Pr(>F)
1 33 224.34
2 32 191.46 1 32.875 5.4946 0.02545 *
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
A single plot for HR to deploy.
What should we do? We could always reconstruct the analysis. It is surprisingly easy; think about it for a second.
DADM: Gaming Devices